1 Background

This file describes the preliminary analyses of three test-concepts in the QLVLnewscorpora: penis, inleiding & hart. The concepts were selected from the full list of concepts (N = 433) that I collected from WordNet, Van Dale and DLP2. Information about the full set of concepts is available here:

2 Model parameters

At this moment, parameters selection was based on observations in Mariana's analyses of nouns & verbs, as well as comments in the parameters google doc. At this moment, the following parameter settings were used to construct token models:

parameter name FOC SOC
Definition target type lemma/pos lemma/pos
Window size fixed: 10 fixed: 4
Boundaries sentence/none none
cw selection: strategy local/global global
cw selection: settings local:
* nav with freq > 200
* collfreq = 3
* ppmi > 1
* llr None or > 1
* global:
nav top-5000
nav top-5000
Weighting ppmi none

Of these, I plan to vary the boundaries (default: sentence) and the context word selection settings for FOC's. Specifically, I will compare implementing an LLR-filter or not within the "local"1 strategy, as well as a local versus a global2 strategy. In the latter case, all top-5000 nav context words will be considered.

3 Concept 1: penis

This concept was selected because it's a difficult one, with many variables (N = 17, excluding constructions) and varying frequencies per variable.

variant frequency
ding/noun 80601
fluit/noun 1447
jongeheer/noun 105
lid/noun 107912
lul/noun 1155
mannelijkheid/noun 459
penis/noun 1252
piemel/noun 372
pik/noun 451
pisser/noun 4
plasser/noun 18
potlood/noun 1504
sjarel/noun 6
snikkel/noun 18
speer/noun 1217
tampeloeres/noun 1
zwengel/noun 42

This causes two problems for the models & analysis:

  • Some variants are not frequent enough for the token models3. As a result, these variants are not returned by the models. The too infrequent variants are: pisser/noun (N = 4), plasser/noun (N = 18), sjarel/noun (N = 6), snikkel/noun (N = 18) and tampeloeres/noun (N = 1).
  • Some variants are too frequent (specifically the polysemous variants ding and lid), which causes computational issues. In addition, even if all the tokens for these variants would be modelled, further steps in the analyses (e.g. clustering) would be problematic too, as the results would be biased towards the highly frequent variants.

A possible solution for the latter problem is to only sample the relevant tokens for the highly frequent types. This can be done in two ways:

  • determining the relevant context words for all variants for the penis-concept
  • determining the relevant context words for the most prototypical variants for the penis-concept (cf. cue-validity). This would be the variant penis. While this strategy might be more easy to implement, as penis is not a highly polysemous word, it may4 also be dangerous because you run the risk of excluding context words that only occur in particular contextual settings (e.g. jocular language).
    Note that it is at this point an open question to which extent this strategy will be necessary (and feasable) for all the concepts in the dataset. In addtition, using this strategy implies that disambiguation needs to be done before constructing the final tokenmodel for all the variants. More specifically, we first need to figure out which model and which clusterin algorithm gives the best semantic analysis of the concept if only the non-problematic variants are included. As a second step, we can then select the semantic space of the token cloud resulting from the best model to determine which FOC's can be considered as candidate FOC's for the problematic variants ding and lid.5

3.1 Selecting context words

3.1.1 Strategies for finding the best model

To find a way of extracting context words for the problematic variants, we need a tokenmodel for the non-problematic ones that performs well. The best model would be a model that (1) has a good fit to the data (to avoid artifical effects, e.g. regional differences) and (2) has a (relatively) clear semantic region (or branch) where most observations for the target concept are located (precision), while out-of-concept tokens are located somewhere else (recall). As in other studies in the NephoSem-project, determining what the best model is, is not straightforward. There are a number of procedures that can be considered:

  • manual disambiguation of all (or some of) the tokens (cf. Mariana's raters)
  • manual inspection of the tokenclouds (but precision vs. recall)
  • automatic disambiguation by also overlaying token vectors for an associated word6 of out-of-concept senses of polysemous items (e.g. papier or pen for potlood). This may complicate the analysis as previous studies have shown that models perform better on certain tasks (e.g. synonyms or associated items) depending on the window size that is used. We can hopefully alleviate this problem by also including other monosemous variants to select context words (e.g. piemel, lul).
  • automatic disambiguation excluding tokens that have particular context words (cf. Dirk's example of scherp for potlood)
  • seperation indices
  • manual inspection of frequenct context words in clusters, excluding clusters that clearly don't have the target meaning
  • ...

3.1.2 Models

So far, models with and without sentence bounds have been constructed according to the following parameters. All the tokenmodels have the following settings:

parameter name FOC SOC
Definition target type lemma/pos lemma/pos
Window size fixed: 10 fixed: 4
Boundaries sentence/none none
cw selection: strategy local/global global
cw selection: settings local:
* nav with freq > 200
* collfreq = 3
* ppmi > 1
* llr None or > 1
* global
nav top-5000
Weighting ppmi none

You can find a shiny-app to explore the models that have been analyzed so far here.

3.1.2.1 With sentence boundaries

The following models only consider context words within the same sentence.

3.1.2.1.1 t-SNE-models

The t-SNE-solutions additionally vary according to two parameters:

  • Number of runs used to calculate the solution: 1000 or 5000. This is particularly useful for this dataset because the variants have very unequal frequencies (and some a 9 times more frequent than others), so a stable solution may not be reached as fast.
  • perplexity: 10, 20, 30, 50.

Overall, it looks like the more stable models are the ones with more runs and perplexity 30. Models with very low perplexity (N = 10) look like they have too many small clusters. Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.

3.1.2.1.2 NMDS-models

I have tried four NMDS-solutions so far:

  1. k = 2, max number of random restarts = 207
  2. increasing the number of restarts to 500
  3. increasing both k and the restarts: k = 5, max number of random restarts = 100
  4. increasing the restarts some more: k = 5, max. number of random restarts = 250

The first nmds-solutions is really bad, with a high stress value (> 0.28) and it did not converge. The second solution ran for over twelve hours and was only at trial 178 with stress-values comparable the first solution (at this point, I killed the process). The third and fourth solutions are the best ones so far, with in both cases a stress value of 0.1334 (for the same trial), but still no convergence. The second dimension may be the one were after but it's not the case that all variants with the target meaning are at the bottom of the plot, not that all out-of-concept variants are at the top.

Since we're running into problems with the NMDS-models, I analyzed where the problematic tokens are located. I used goodness() from library(vegan) to obtain a goodness-of-fit-value per token:

goodness() finds a goodness of fit statistic for observations (points). This is defined so that sum of squared values is equal to squared stress. Large values indicate poor fit.

This plot shows the results for the fourth NMDS-solution. It shows that the less problematic tokens (lighter colours) are located at the top left op the plot, where the observations for fluit and potlood are located - typically with their prototypical meaning, as well as with the tokens for penis in the middle. Perhaps the model has more trouble with variants that are more polysemous

3.1.2.1.3 Hierarchical clustering

Finally, I also used agglomerative hierarchical clustering (Ward's method) to analyze these data. Rather than choosing a number of clusters beforehand, I considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 45 in these data (sw = 0.358), with solutions that have 15 clusters or more reaching acceptable results (sw > 0.2). Obviously, solutions with 15 or more clusters are difficult to intepret, but for the purpose of illustration, this plot shows the solution with 15 clusters isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution with perplexity = 30 and 5000 runs. Some of the clusters make a lot of sense, especially when they're also the ones that are seperated by t-SNE as well (e.g. the ouwe lul-cluster at the right of the plot in magenta). Other are more diverse (e.g. clusters 2 and 5).

With fewer clusters only some of the clearer division are (obviously) retained. Cluster 3 in the solution above, for instance, has body parts as context words. In the solution below, it is added to the more diverse cluster 2.

3.1.2.2 Without sentence bounds

The following models also consider context words outside of the sentence.

3.1.2.2.1 t-SNE-models

The t-SNE-solutions again vary according to two parameters:

  • Number of runs used to calculate the solution: 1000 or 5000. This is particularly useful for this dataset because the variants have very unequal frequencies (and some a 9 times more frequent than others), so a stable solution may not be reached as fast.
  • perplexity: 10, 20, 30, 50.

It looks like the models have more clear clusters from perplexity = 30 onwards (both for 1000 and 5000 runs). Models with very low perplexity (N = 10/20) again look like they have too many small clusters. Interestingly, with 1000 runs and perplexity 30, the 'oude lul'-cluster is moved far away (left of the plot) from the uses of 'lul' referring to the target concept (right side of the plot). All in all, this looks like a very good solution, with the target concept mostly in the bottom right quadrant of the plot (though it's not completely perfect).
Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.

3.1.2.2.2 NMDS

In contrast with the first model, I no longer constructed an NMDS-solution for k = 2 dimensions, as this caused problems. The NMDS-solutions again throw a convergence error after finishing. This indicates that we can't be sure that the solution it has is a local optimum rather than a global one. It may be necessary to change more parameter settings (specifically the convergence criteria), as the metaMDS-vignette states:

In addition to too slack convergence criteria and too low number of random starts, wrong number of dimensions (argument k) is the most common reason for not finding convergent solutions.

In addition, some resources argue that using the default Bray distance is inappropriate for non-ecological data. It may be better to use euclidian distances instead.8

The solutions for nmds nruns = 100 and 250 are identical. This probably has to do with the fact that they choose the same run as the best solution. The relevant tokens are in the top part of the plot, although the clusters are not perfect.

3.1.2.2.3 Hierarchical clustering

For the hierarchical clustering algorithm, I again considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 49 in these data (sw = 0.332), with solutions that have 16 clusters or more reaching acceptable results (sw > 0.2). Again this plot shows the solution with 16 clusters (isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution without sentence bounds with perplexity = 30 and 1000 runs. In most cases, context words explain why certain clusters are formed (e.g. cluster 12: tokens having to do with measurements --> for speer these are mostly tokens related to sports, cluster 14 --> small/large).

3.1.3 Summarizing the workflow and going from here

To summarize, the workflow so far consists of the following steps:

  1. Construct two token models, only varying the paramater 'boundary' for the selection of first-order context words (either only including context words within the sentence boundaries or including all the candidate context words within the specified window size)
  2. Analyze these models with t-SNE, NMDS and hierachical clustering
  • For t-SNE, two additional parameters were varied:
    • perplexity (how many neighbours is each token expected to have)
    • the number of iterations (generally, the rule is that after enough iterations, the model only changes marginally anymore)
    • other options are to vary the number of dimensions (currently, I'm only looking at solutions for two dimensions, in contrast with the NMDS-solutions), to vary perplexity even more (do we really think that the maximum number of neighbours for e.g. the potlood-tokens is only 50 if there are N = 1085 tokens for potlood in total?) and to use PCA before constructing the t-SNE solution (especially since we have a large dataset, as recommended here).
  • For NMDS, two paremeters were varied:
    • the number of dimensions k: 2 or 5. Note that I used k = 5 for all the solutions in the nobound-model because the algorithm was exceptionally slow with only 2 dimensions, which may suggests that it has trouble reducing the dimensionality of the data to k = 2 (I think).
    • the number of random starts (used to make sure you truly have the best global solution)
    • many other parameters can be varied here, including the dissimilarity measure used (default: Bray-Curtis) and the convergence criterium
    • calculating NMDS is very slow and shows convergence errors in every solution. One solution may be to use parallel processing with the parallel-argument.
  • For hierarchical clustering, I checked in each case what the optimal number of clusters would be for the data (with max nr of clusters <= 50). This number is always larger than 40, which is probably to much to handle. However, solutions with fewer clusters also seem to make some sense.

The end goal of the current analysis is to find a set of candidate contextwords to sample tokens for the words that are currently not included (e.g. ding, lid), even though they are also synonyms for our target concept. So the question at this moment is: do we have enough information to make an informed choice about the context words that need to be present in the context of tokens included in this sample?

I think that we do: in my opinion, we can combine the best t-SNE solution for each model with the clustering result (possibly the 15 and 16 clusters from the two models above) to select tokens that most likely refer to the target concept and then extract the (most frequent) context words from them. The process can take the following form:

  1. Randomly sample 50 (?) tokens from each of the clusters in the best cluster solutions (i.e. the 15 clusters in the first model and the 16 clusters in the second model). The best cluster solutions here are defined as the solutions with the minimal amount of clusters necessary to obtain an average silhouette value of at least 0.2.
  2. Manually analyze each of these 31-by-50 tokens, indicating whether they are out of concept or not. Also record the lemma of the target variant for each of the tokens. In most cases, I don't think it would be necessary to indicate what the meaning of the token is in more detail if it's out of concept. Only for the lexical items with high cue validity (penis, lul), it may be interesting to indicate that we prossibly have tokens with idiomatic expressions (e.g. de ouwe lul-cluster).
  3. Calculate whether the majority of the tokens per cluster are in-concept or not.
  4. Calculate the (most frequent) context words for the in-concept and out-of-concept and combine this information with information about the most frequent lemmas per cluster. This should give us a good idea of the meaning of each clusters (although there will probably also be clusters that are very diverse). Use this information to come up with a list of context words that point to tokens being eihter 'very likely in-concept' or 'very likely out-of-concept'.
  5. Divide the tokens for the variants that are currently excluded into three groups:
  • very likely in-concept
  • very likely out-of-concept
  • other

I expect that we may find a relatively large amount of 'other' tokens. One solution would be to also take a sample of these and analyze them manually. I don't think that there will be many tokens in this group that do refer to the target word, unless they are part of a fixed expression. In the latter case, it is an open question whether we should include them in the final analysis, because the variants used in a fixed expression is not necessarily synonymous (or interchangeable) with the other variants in that context. 6. Begin working on the actual token model that will be used in the analysis :)

3.1.4 Things to keep in mind to construct the final token model

Especially for this next part of the analysis, it will be important to not just find any solution that looks acceptable, but to come up with the best solution (if that exists). This will take two forms.

First, it may be necessary to vary more parameters than just the boundary-parameter in the token model, and to come up with the best possible dimensionality reduction solution, including determining the most appropriate parameter settings for the algorithms (cf. the NMDS-remarks above). While this is not so difficult to do, the difficult part will be deciding which model is the best one. Some questions we can easily answer are:

How different are the models really and how do they differ?
We can use procrustes analysis to make pairwise comparisons of the models. Some notes:

  • According to most of what I've read, procrustes analysis is mostly used with (n-)MDS solutions. I'm assuming it can be used for t-SNE too (the principle underlying the method should also work for t-SNE).
  • I don't know if it's a good idea to directly compare t-SNE solutions with NMDS-solutions. A procrustes analysis tries to find out how much rotation is necessary to get one solution as close as possible to the other solution. However, the NMDS- and t-SNE-agorithms have somewhat different goals (in the sense that t-SNE is particularly suited for finding compact clusters). For this reason, I'm not sure it's a good idea to compare them directly to each other.
  • the procrustes()-function in library(vegan) also comes with a function protest() that shows whether the difference between two models is significantly large. This may be an interesting result to look at.

I've already tried out a procrustes-analyses on two solutions for the first and second model described above (t-SNE, perplexity = 30, nruns = 5000). The plot below shows the residuals for the procrustes analysis, with lighter colors indicating a larger difference between model 2 and model 1.

How stable are the models for other concepts?
Of course, so far I've been focusing on a single concept, even though I selected three test concepts. While I expected this concept to be a pretty difficult one, the results are interpretable. Perhaps it would be useful to construct tokenmodels for the other two concepts before turning to the final model for penis as this may reveal other problems that may be currently not yet obvious in the current data?

Second, another question that we don't have an answer to yet is how scalable the procedure used so far is or, put differently, how necessary it is to do the intermediate step of constructing a tokenmodel for a subset of the variants. Analyzing the other concepts might shed some more light on this question.


  1. Defined in the google doc as: "potentially all words within the specified window span around the target token". Note that my definition of "local" is not extreme, as I am only including nav's with a frequency of > 200. However, it is local in the sense that potentially all these words can be considered (N = 37807)

  2. Defined in the google doc as "fixeds set of context words, same for all target types". Here the 5000 most frequent nav's.

  3. This may also be related to the parameter settings that are used, e.g. if no nav's of frequency > 200 occur with the target type in a particular token/observation, this type is not included in the model.

  4. Note that while it may be dangerous to use this strategy, it doesn't have to be. We just don't know yet.

  5. An alternative strategy may be to semasiologically analyze these variants. Specifically for ding this could be a fruitful approach, because this variant is highly polysemous and is also included as a hgh-level word in the WordNet-taxonomies. It is not known whether the penis-meaning of ding would show up in such an analysis.

  6. We could select high-frequency candidates from the association data of Gert Storms for this purpose.

  7. This determines how many times the algrotihm can try to find a stable solution. If it doesn't succeed in the specified number of random starts, there is no succesful convergence.

  8. e.g. this one: http://strata.uga.edu/8370/lecturenotes/multidimensionalScaling.html, but see this one, which argues that one of the disadvantages of Bray dissimiliarity is that it uses raw counts - not a problem in our case because every token has exactly one measurement point for each context word:http://www.econ.upf.edu/~michael/stanford/maeb5.pdf. Another problem is that Bray-Curtis aren't really distances but dissimilarities, aslo see Borchard et al., Numerical ecology with R (with Legendre)